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Abstract 

o 

We consider a problem of estimating a sparse group of sparse normal mean vectors. The 
. . proposed approach is based on penalized likelihood estimation with complexity penalties on 

. the number of nonzero mean vectors and the numbers of their "significant" components, and 

can be performed by a computationally fast algorithm. The resulting estimators are developed 
within Bayesian framework and can be viewed as MAP estimators. We establish their adaptive 
minimaxity over a wide range of sparse and dense settings. The presented short simulation 
study demonstrates the efficiency of the proposed approach that successfully competes with the 
recently developed sparse group lasso estimator. 



Keywords: Adaptive minimaxity; complexity penalty; maximum a posteriori rule; sparsity; 
thresholding. 

1 Introduction 

Suppose we observe a series of m independent n-dimensional Gaussian vectors yi,...,ym with 
independent components and common variance: 

yj = fij + ej, ej ' 7V„(0, cr^In), j = 1, m (1) 



1 



The variance cj^ > 0, which may depend on n, is assumed to be known, and the goal is to estimate 
the unknown mean vectors /^i , . . . , n^n ■ 

The key extra assumption on the model ([1]) is both within- and between-vectors sparsity 
(hereafter within- and between-sp&rsity for brevity). More specifically, we assume that part of 
/Xj's are identically zero vectors and the entire information in the noisy data is contained only in 
a small fraction of them (between-sparsity). Moreover, even within nonzero /^j's, most of their 
components are still zeroes or at least "negligible" (within-sparsity) . Formally, the within-sparsity 
can be quantified in terms of Iq, strong or weak /p-balls introduced further. Neither the indices of 
non-zero //^'s nor the locations of their "significant" components are known in advance. 

Such a model appears in the variety of statistical applications as we illustrate by the following 
two examples. 

Example 1. Additive models. Consider a nonparametric regression model yi = /(xij, Xm^) + 
ej, i = 1, ...,n, where / : M'" — t- M is the unknown regression function assumed to belong to some 
class of functions (e.g.. Holder, Sobolev or Besov classes), and ej AA(0, cr^). Estimating / in 
such a general setup suffers from a severe "curse of dimensionality", where typically the sample 
size n should grow exponentially with the dimensionality m to achieve consistent estimation. It is 
essential then to place some extra restrictions on the complexity of /. One of the most common 
approaches is to consider the additive models, where f{xi,...,Xm) = fi{xi) + ... + fm{xm) and 
each component fj lies in some smoothness class. In addition, similar to sparse linear regression 
models, it is often reasonable to assume that only part of predictors among are really 

"significant", while the impact of others is negligible if at all. Such sparse additive models are 
especially relevant for m ^ n and n setups and have been considered in Lin & Zhang (2006), 
Meier, van de Geer & Buhlmann (2009), Ravikumar et al. (2009), Raskutti, Wainwright & Yu 
(2012). 

Expand each fj, j = 1, ...,m into (univariate) orthonormal series {ipij} as f^ijijjijixj), where 
fiij = J fj{xj)i{jij{xj)dxj. The original nonparametric additive model is then transformed into the 
equivalent problem of estimating vectors of corresponding coefficients fii,...,fx^ within Gaussian 
noise ([T]), where for sparse additive models, most of fij are zeroes (between-sparsity). Moreover, for 
a properly chosen bases {ipji} (e.g., Fourier series for Sobolev or wavelets for more general Besov 
classes), the nonzero /Xj will be also sparse (within-sparsity). 

Example 2. Time-course microarray experiments. In time-course microarray experiments the data 
consists of measurements of differences in the expression levels between "treated" and "control" 
samples of m genes recorded at different times. A record on j-th gene at time point is modelled as 
a measurement of an (unknown) expression profile function fj (t) at time ti corrupted by Gaussian 
noise. The expression of most genes are the same in both groups {fj = 0) and the goal is to 
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identify the differentially expressed genes and estimate the corresponding non-identically zero 
expression profile functions fj. Similar to the previous example, each fj is commonly expanded 
into some "parsimonious" orthonormal basis (e.g., Legendre polynomials, Fourier or wavelets) as 
fji^) — J2i l^ij'^Piji'^) ™d coefficients domain the original functional model becomes 

Uij — l^ij ~\~ Zij 1 j ~ ^1 ■■■i'^] 2=1, ...,77. 

where yij are empirical coefficients of the data on j-th. gene and Zij are Gaussian noise (see, e.g., 
Angelini et. al, 2007). For most genes, /ij = (between-sparsity), while due to the parsimonity of 
the chosen basis, for differentially expressed genes, fij will still have sparse representation (within- 
sparsity) . 

To estimate fii, /x^ in ([T|) under the assumptions of between- and within-sparsity we proceed 
as follows. From a series of pioneer works of Donoho &: Johnstone in nineties (e.g., Donoho 
& Johnstone, 1994ab), it is well-known that the optimal strategy for estimating a single sparse 
vector from yj is thresholding. Various threshold estimators p,j can be considered as penalized 
likelihood estimators, where 

A, = arg _min ||yj - p,A\l + Penj{p,A, 

corresponding to different choices of penalties Penj{fi). In particular, the /i-type penalty 
Penj{fij) = X\\fij\\i leads to soft thresholding of components of /ij with a constant threshold 
A/2 that coincides with the lasso estimator of Tibshirani (1996). Wider classes of penalties on 
the magnitudes of components fiij are discussed in Antoniadis &: Fan (2001). In this paper we 
consider the /q or complexity type penalties Pen^d |/ijHo) on the number of nonzero components 
Jlij, where ||/ij||o = : Aij 7^ 0}) that yield hard thresholding rules. In the simplest case, 

where Penj(||/ij||o) = A||/ij||o) the resulting (constant) threshold is More general complexity 
penalties were studied in Birge & Massart (2001), Abramovich, Grinshtein & Pensky (2007), 
Abramovich et al. (2010) and Wu & Zhou (2012). 

Penalizing each /Xj separately, however, essentially ignores the between-sparsity, where it is 
assumed that most of /ij are identically zeroes and should be obviously estimated by fij = 0. 
Thus, simultaneous estimation of all m mean vectors in ([T|) should involve an additional penalty 
PenQ{-) on the number of nonzero /i/s that are now defined as solutions of the following criterion: 
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5^ { 1 1 - 1 1 1 + Penj ( 1 1 1 1 o) } + Peno {k)\, (2) 



where k = ^{j : p,j 7^ 0}. In this paper we investigate the optimality of such an approach 
for estimating /i^,...,/x^ under various within- and between-sparsity setups. In particular, we 
specify the classes of complexity penalties Penj(||Aj||o) and Peno{k) on respectively within- and 



3 



between sparsity for which the resulting estimators fii, p,^ achieve asymptotically minimax rates 
simultaneously for the wide range of sparse and dense cases. Such types of penalties naturally arise 
within a Bayesian model selection framework. In this sense, this paper extends the results of 
Bayesian MAP testimation approach developed in Abramovich, Grinshtein & Pensky (2007) and 
Abramovich et al. (2010) for estimating a single normal mean vector to simultaneous estimation 
of a group of m vectors in the model ([T]). 

It is interesting to compare the proposed complexity penalization ([2]) with lasso-type procedures. 
Similar to iQ-type penalization, the vector-wise use of the original lasso of Tibshirani (1996) for 
estimating each /i^- in ([1]) results in per-component (soft) thresholding of each that handles 
within-sparsity but ignores between-sparsity. To address the latter. Yuan &: Lin (2006) proposed a 
group lasso that for the particular model ([T]) at hand solves 

m 

_ mm V{||yi -/ijlli + AII/ijIb} 

It can be easily shown that in such a setup, the group lasso estimator is available in the closed form, 
namely, /x^- = (1 — y|^j^)+yj, j = l,...,m which is the vector-level "shrink-or-kill" thresholding 
with a threshold A/2. The /i^'s are, therefore, either entirely zero or do not have zero components at 
all. As a result, the group lasso does not handle within-sparsity. To combine both types of sparsity, 
Friedman, Hastie & Tibshirani (2010) introduced the sparse group lasso that for the model ([1]) is 
defined as 

m 

_ mm V {IIYj - Ajlll + -^illAjIb + A2||Ajlli} (3) 
/ii,...,At„eK" j.^i 

A /2 

yielding = (1 - \^^)+yj^ j = 1' where yij = s\gi\{yij){\yij\ - A2/2)+, i = 1, ...,n is the 

result of component-level soft thresholding of each with a threshold A2/2. 

To the best of our knowledge, there are no theoretical results on optimality of sparse group lasso 
similar to those presented in this paper for the complexity penalized estimators ([2|). Moreover, we 
believe that, generally, Zo-type penalties are more "natural" for representing sparsity and the main 
reason for other types of penalties [li in particular) are mostly computational. For a general 
regression model, complexity penalties indeed imply combinatorial search over all possible models, 
while, for example, sparse group lasso estimator can be still efficiently computed by numerical 
iterative algorithms (see Friedman, Hastie & Tibshirani, 2010 and Simon et al., 2011 for details). 
However, for the model ([T]), that can be essentially viewed as a special case of a general regression 
setup, ([2]) can be also solved by fast algorithms (see Section [2|) that makes such computational 
arguments irrelevant. 

The paper is organized as follows. In Section [2] we develop a Bayesian formalism that gives 
raise to penalized estimators ([2]). The asymptotic (as both m and n increase) adaptive minimaxity 
of the resulting sparse group MAP estimators over various sparse and dense settings is investigated 
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in Section [3J The short simulation study is presented in Section U] and some concluding remarks 
are given in Section [5j All the proofs are placed in the Appendix. 



2 Bayesian sparse group MAP estimation 

Consider again the model ([T]). If we knew the indices of nonzero vectors fij and the locations of their 
"significant" entries /ijj, we would evidently estimate them by the corresponding yij and set others 
to zero. Hence, the original problem is essentially reduced to finding an n x m indicator matrix 
D, where dij indicates whether fxij is "significant" or not, and can be viewed as a model selection 
problem. Note that due to between- and within-sparsity assumptions, the matrix D should be 
sparse in the double sense: only part of D's columns dj are supposed to be nonzeroes, and even 
nonzero columns are sparse. 

We introduce first some notations. Let j7o and JTJ be the sets of indices corresponding 
respectively to zero and nonzero mean vectors /^^'s, and rriQ = \J^q \ = : fJ,j ^ 0, J = 1, ...,m}. 
Denote by hj = XlILi dij = ■ fJ-ij 7^ 0, i = 1, ...,n} the number of nonzero components in fij, 
where evidently hj = for j £ Jo- 

Consider the following Bayesian model selection procedure for identifying nonzero components 
fiij or, equivalently, the indicator matrix D. To capture the between- and within-sparsity 
assumptions we place a hierarchical prior on D. We first assume some prior distribution on the 
number of nonzero mean vectors niQ ~ 7ro(mo) > 0, ruQ = 0,...,m. For a given mo, assume 
that all (^) different configurations of zero and nonzero mean vectors are equally likely, that is, 
conditionally on ttiq, 



Obviously, hj\{j G Jo} ~ 6{0) and, thus, dj\{j G Jo} ~ 6{0) and G Jo} ~ (5(0). For 

nonzero we place independent priors vrj(-) on the number of their nonzero components, that is, 
^ <^o^} ~ ^i(^i) > 0, hj = l,...,n. In this case, we again assume that for a given hj, all 



Finally, to complete the prior for ([1]), we have fiij\dij = ~ 5(0), while nonzero /Xy are assumed to 
be i.i.d. Af(0,7cr^), where 7 > 0. 

A straightforward Bayesian calculus yields the posterior probability for a given indicator matrix 





nonzero components have the same prior probabilities 



and, therefore, 
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P{D\y) oc TToimo) 
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Given the posterior distribution P{D\y) we apply the maximum a posteriori (MAP) rule to choose 
the most likely configuration of zero and nonzero /ijj that leads to the following MAP criterion: 

E |E4^^:^+2^'(l + V7)ln(^vr,(/i,)(^^J V + t)"^^ | +2^^(1+1/7) In (^vro(mo) (^J ' 

(4) ^ 

From Q it follows immediately that for a given hj > the optimal choice dj{hj) for dj is dij{hj) = 1 
for the hj largest \yij\ and zero otherwise. The criterion (jH is then reduced to 



E |E4w + 2^'(l + V7)ln(^^i(/i,)(^''J \l + 7)"^ j 1+2.7^(1+1/7) In (^^o(mo)(^J ' 

(5)' 

where \y{i)j\ > ••• > |y(ra)jl- For every j = 1, ...,m define 



= arg^min J 4). + + V7) In (tt-^/.,) ( ^ ) (1 + 7)^ 



(6) 



Then, ([5]) is equivalent to minimizing 

h 



E I - E 4b- + + 1/7) In (vrTi (/,,.) (^^J (1+7)^) | +2^^(1+1/7) In [n^Hruo) (^J) 

(7) 

over all subsets of indices C {1, ...,m}. Define 

= - E 4)i + 2^7^(1 + 1/7) In (^7TT\hj) (J^^ (1 + 7)^) (8) 



Then, ([7]) is obviously reduced to 



min \ W^j) + 2^2(1 + 1/7) In f vro^Hmo) 

0<mo<m \ \^0/ / 



(9) 



where 14^(1) < ... < and for tuq = the sum in the RHS of ([7]) evidently does not appear. 

Summarizing, the efficient simple algorithm for finding the proposed sparse group MAP 
estimators of ^i, in ([I]) can be formulated as follows: 

Sparse group MAP estimation algorithm 

1. For every j = 1, ...,m, find hj in and calculate the corresponding Wj in 
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2. Order Wj in ascending order VFj-j^) < ... < VF(m) and find 



3. Let he the set of indices corresponding to the itiq smallest Wj. Set fij = for all j G j'o, 
while for j G J'q, take the hj largest \yij\ and threshold others, that is, fuj = yijl{\yij\ > 
\y{hj)j\}^ ^ = j e Jo> where > ... > \y{n)j\- 



The resulting estimation procedure combines therefore vector-wise and component-wise 
thresholding. It is easily verified that the minimizer of d?]) is, in fact, the penalized likelihood 
estimator ([2]) with the complexity penalties 

Pen,(0) = 0, Penj{hj) = 2al{l + l/j)ln(^7TT\hj)(^^yi + j)'^y hj = l,...,m (10) 

and 

Peno(mo) = 2a^(l + l/7)ln(^7ro-\mo)(^^^J^ , mo = 0,...,m (11) 

The specific types of penalties Penj(-)'s and Peno(-) depend on the choices of priors vrj(-)'s and 
7ro(-). For example, binomial priors mo ~ B{m,(,Q) and hj ~ B(n,S^j) yield linear type penalties 
Pen(mo) = 2c7^AQmo and Penj{hj) = 2(T^A|/ij respectively, where Ag = (1 + l/7)ln{(l — ^o)/Co} 
and A^ = (1 + I/7) Inj-^/l + 7(1 — ^j)/Cj}- For such a choice of vrj(-), Wj in dH) is essentially 
obtained by hard thresholding of with a constant threshold \pian\j- In particular, ^j = 
t/7 + l/(-^7 + 1 + n"'^^"'~^^^) leads to the universal thresholding of Donoho & Johnstone (1994a) 
with Xj = \/lnn. The (truncated) geometric priors TTj{hj) oc (7^^, hj = 1, n for some < (/j < 1, 
imply the (nonlinear) so-called 2/c ln(n//c)-type penalties. The optimality of the resulting hard 
thresholding estimator with a data-driven threshold for estimating a single normal mean vector has 
been shown in Abramovich, Grinshtein & Pensky (2007), Abramovich et. al (2010), Wu &: Zhou 
(2012). 



3 Adaptive minimaxity of sparse group MAP estimators 

In this section we investigate the goodness of the proposed sparse group MAP estimators ^ 
with the penalties (|10p -(jli p . where the goodness-of-fit is measured by the global quadratic risk 
Yl'jLi ^WP-j ~ /^jlli- establish their asymptotic minimaxity over a wide range of sparse and 
dense settings. To derive these results we need the following assumption on the priors vrj(-): 

Assumption (P). Assume that 

7Tj{h)< (l)e-'^^^\ h=l,...,n, j = l,...,m, (12) 
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where 0(7) = 8(7 + 3/4)2 > 9/2. 

Assumption (P) is, in fact, not restrictive. Indeed, the obvious inequality (^) > (n/h)^ implies 
that for any '/rj(-), ()12p holds for all h < ne~^^'^\ In particular. Assumption (P) is satisfied for 
binomial priors B(n,^j) with < e~'^^'^^/{l + 6"*^*-'^)) and (truncated) geometric priors. 

First, we obtain a general upper bound for the quadratic risk of the sparse group MAP estimator 
that will be the key for deriving its asymptotic minimaxity. 

Theorem 1 (general upper bound). Consider the sparse group MAP estimators fii, fi,^ ^ 
of ^i, fj,^ with the complexity penalties ilO\ }- [Tl\) in the model UP. Under Assumption (P) we 



have 



5:E||A,-/.,||i < -1(7) ^^^^min^^^ 



+ 5^E4 + ^^^o(l^ol)>+C2(7)^n(l-vro(0)), (13) 

where > ••• > l^(n)jl ^'^^ ^1(7), 02(7) depend only on 7. 

The results of Theorem [T] hold for any normal mean vectors /x^, Now we consider ([T]) 

under the extra within- and between-sparsity assumptions that will be defined more rigorously 
below. 

The between-sparsity is naturally measured by the number niQ of nonzero /^j's. The within- 
sparsity can be introduced in several ways. The most intuitive measure of within-sparsity of a single 
normal mean vector fi G M" is the number of its nonzero components, that is, its Iq quasi- norm 
||/i||o. Define then an /o-ball lo[r]] of standardized radius 77 as a set of with at most a proportion 
r] of non-zero entries, that is 

ZoM = {/^GM" : ||/x||o <7?n} 

One can argue that in many practical settings, it is more reasonable to assume that the components 
/ij's of /X are not exactly zero but "small". In a wider sense the within-sparsity of /i can be 
then defined by the proportion of its large entries. Formally, define a weak /p-ball ^7^p[7/] with a 
standardized radius rj as 

mp[r]] = {/X G M" : < anr]{n/iy/P, i = l,...,n}, 

where > ... > are the ordered components of /x. For n £ rnp[r]], the proportion of |^i|'s 
larger than (t„(5 for some 5 > is at most {r]/6)^. 

Within-sparsity can be also measured in terms of the /p-norm of /x, where a strong /p-ball lp[ri] 
with standardized radius rj is defined as 

1 

/p[r?]={/.GE" :-J^|^,|P<<r?P} 

i=i 
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There are well-known relationships between these types of balls. The /p-norm approaches Iq as p 
decreases, while a weak /p-ball contains the corresponding strong /p-ball but only just: 

Ipir]] C mp[r]] <t VW, v' > V 

We recall first the known results on minimax rates for estimating a single normal mean vector 
H over different types of balls introduced above. Let @[r]n] C M" be any of /o[??n]; ^p[??n] or rrip [?]„], 
where the standardized radius i] might depend on n. The corresponding minimax quadratic risk 
for estimating a single /x (m = 1) over 0[r/n] in & is R{@[7]n]) = inf^ sup^g0[^^] i?| |/i — /^Hl, 
where the infimum is taken over all estimates p, of fx. For p > define ?7on = n~ ' ' nn. 

Depending on the behaviour of ?]„ as n increases, we distinguish between three cases for p > and 
two cases for p = 0: 

a) dense, where r]n ^ 

b) sparse, where r]n ^ but rjn/rj^n A for p > and, obviously, rjn > n~^ for p = 

c) super-sparse (for p > 0), where rjn/rjQn — ^ 

The corresponding minimax convergence rates over R{Q[r]ri\) for various cases and p are summarized 
in Table [T] below (see Donoho et. al, 1992; Johnstone, 1994; Donoho &: Johnstone, 1994b). 

The rates for mp[r]ri\ are the same as for /p[r/„] except p = 2, where there is an additional 
log-term. Table [T] defines dense and sparse zones for p = and p > 2, and dense, sparse and 
super-sparse zones for < p < 2 of different minimax rates. 



Case 


p = 


<p < 2 


P>2 


dense case 


aln 


aln 


aln 


sparse case 


cj2n7?„(ln??-i) 


alnrfn{lnrj-Py~P/^ 


(^ln7]l 


super-sparse case 




aln^/P7]l 


(^IWn 



Table 1: Minimax rates (up to multiplying constants) over various /o[^n]) ^piVn] and mp[r7.„] -balls. 
The rates are the same for lp[r]n] and mp[rjn] except p = 2, where for mp[r]n] there appears the 
additional log-term which is not presented in Table [T] for brevity. 

Consider now the model ([T|) for m > 1. Recall that = ^{j : fij ^ 0} and J'q is the set of 
indices for nonzero fXj. In what follows we assume that fij G QjiVjn] for j G Jq, where the types 
{Iq, weak rup or strong Ip) and the parameters p of the corresponding balls are not necessarily the 
same for all j. Furthermore, we allow the priors vro(-) and vrj(-) to depend respectively on m and 
n. 

Theorem [2] below defines the asymptotic upper bounds for the quadratic risks of the sparse 
group MAP estimator in ([1]) under within- and between sparsity assumptions: 
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Theorem 2 (upper bounds over sparse and dense settings). Consider the model where Jq ^ 
(not pure noise). Assume that fij G QjiVjn] for all j G J^, where rjjn > n~^l '2) -y/ln n for all 
Pj > (excluding, thus, super-sparse cases). 

Let Ai, be the sparse group MAP estimators with the complexity penalties fW \) -[ll \) . 
where assume that there exist constants cq,ci > and C2 > 0(7) such that 

1. TTo{k) > {k/mY°^, k = 1, [m/ej and 7ro(m) > e"'^"'" 

2. for all j = l,...,m, TTj{-) satisfy Assumption (P) and, in addition, TTj{h) > [h/uY^^, h = 
1,..., [ne-'^Wj; 7rj(n) > e-'^^" 

Then, for any Jq C {1, m} with \Jq \ = rriQ and all &j[i]jn]i j £ 'Jq, 

sup ^E\\fij -/ijlli < Ci(7)max ^ ii(0j[?7jn]), cr^mo ln(m/mo) (14) 

for some constant 6*1(7) depending only on 7, where the corresponding R{Qj[r]n]) are given in Table 
m (up to multiplying constants). 

Theorem [2] shows that as both m and n increase, the asymptotic convergence rates in ()14p are 
either of order "^j^jc R{Qj[i]jn]) or cj^ttio ln(m/mo). The former is associated with the optimal 
rates of estimating niQ single sparse vectors in Qjirjjn], j G Jq, while the latter appears in the 
optimal rates in the model selection and corresponds to the error of selecting a subset of tuq 
nonzero elements out of m (see, e.g. Abramovich & Grinshtein, 2010; Raskutti, Wainwright & Yu, 
2011; Rigollet & Tsybakov, 2011). From Table [U it follows that for all within-dense and within- 
sparse cases, Cia^lnn < R{Qj[r]jn]) < C20"^n, j £ Jq for some Ci, C2 > and, therefore, the 
first term 'Y^j^jc R{Qj[rin]) in the upper bound (jl4p is always dominating for tuq > m/n, while the 
second term cj^mo ln(m/mo) is necessarily the main one for mg < m/e"". 

One can easily verify that the conditions on the priors vro(-) and vrj(-) required in Theorem 
[2] are satisfied, for example, for the (truncated) geometric priors (see Section [S]). On the other 
hand, no binomial priors ttq = B{m,(^Q) or ttj = B{n,(^j) can satisfy all of them: the requirement 
'Kj{n) = S," > e~'^2n yigidg > e~'^2^ while to have vrj(l) = n(,j{l — ^ n^'^i one needs — )■ 

as n increases. 

To establish the corresponding lower bound for the minimax risk, for simplicity of exposition we 

consider only the two cases, where pj for j € are either all zeroes or all positive. In fact, these 
are the two main scenarios appearing in various setups. Somewhat similar results for minimax 
lower bounds in the particular context of sparse nonparametric additive models (see Introduction) 
appear in Raskutti, Wainwright and Yu (2012). 
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Theorem 3 (minimax lower bounds for /o-balls). Consider the model where fij £ ^oiv jn]> j G 
J'q. Assume that \J'q\ = mo > 0. Then, there exists a constant C2 > such that 

( \ 

inf sup - /ij lll > Camax ii(/o[??jn]), o-n"^o ln(m/mo) , (15) 

A^i.-,At,„At,e«oten],ie:/;f j=i \j^jc j 

where the infimum is taken over all estimates p,i, of fi^, 

Theorem [3] shows that, as m and n increase, the rates in ()14l) cannot be improved for ^o-balls. 
The proposed sparse group MAP estimator in this case is, therefore, adaptive to the unknown 
degrees of within- and between-sparsity and is simultaneously rate-optimal (in the minimax sense) 
over entire range of dense and sparse /o-balls settings. 

The analysis of the case pj > is slightly more delicate. Note first that due to the embedding 
properties of Zp-balls for p > (see above), it is sufficient to establish the minimax lower bounds 
for strong /p-balls settings. 

Theorem 4 (minimax lower bounds for /p-balls). Consider the model HP, where fij £ IpjiVjn], j G 
J'q and \ J'q \ = rriQ > 0. In addition, assume that rjj^ > max (Inn, ln(m/mo)). Under 

this additional constraint, there exists a constant C2 > such that 

/ \ 

inf sup ^^E\\iij - ^jWl >C2max ^ i?(/pj?7j J), a^mo ln(m/mo) , (16) 



where the infimum is taken over all estimates p-i, ...,fjb^ of /i^, 

Similar to Theorem [3l Theorem |4] implies simultaneous optimality (in the minimax sense) of 
MAP sparse group estimator over strong and weak Zp-balls but with the restriction on r]jn and rriQ. 
In particular, it does not cover settings with within-super-sparsity but might also exclude part of 
the corresponding within-sparse zone (depending on mo). Within- and between-sparsity cannot be 
"too strong" both. In fact, the condition T]j^ < n"^/™™'^^^'^'' max (Inn, ln(?7i/mo)) , j G J^q can be 
viewed as an extended definition of super-sparsity for m > 1. For such a super-sparse case, the 
minimax bound p6p does not hold and can be reduced. Indeed, consider the trivial zero estimator 
/i = 0, j = 1, ...,m, where, evidently, 

m 

sup "^EWflj - fxj\\l= sup X] ll^i'la (I*") 

The least favourable sequences that maximize HAtjUl over lpj[r]jn] are {an7]jn, (^nVjny 
and {anTjjnn^^^^ ,0, ...,0)' for pj > 2 and < pj < 2 respectively. Thus, 
sup^.gip,[^^^] llAtjIli = cT^r/|„n^/™"(P-"^) and the RHS of ([IT]) is less than cj^mo ln(?Ti/mo) for 
V'jn < n-2/'^i^fe'2)ln(m/mo), j G J^. This goes along the lines with the corresponding results for 
estimating a single normal mean vector, where a zero estimator is known to be rate-optimal for the 
super-sparse case (Donoho & Johnstone, 1994b). 
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4 Simulation study 



A short simulation study was carried out to demonstrate the performance of the proposed approach. 

The data was generated according to the model ([T]) with m = 10 vectors /x^ 's of length n = 100. 
Five Hj^s were identically zeroes, while the other five had respectively 100, 70, 50, 20 and 5 nonzero 
components randomly sampled from A^(0, r^), r = 1,3,5 and zero others. Such a setup covers 
various types of within-sparsity. Finally, the independent standard Gaussian noise A^(0, 1) was 
added to all components of each fij. 

We tried binomial and truncated geometric priors for sparse group MAP estimators. For the 
binomial prior, we performed component- wise universal hard thresholding of Donoho &: Johnstone 
(1994a) with a threshold A = ay/2 log n within each vector that essentially corresponds to 

= y/j + l/(-v/7 + l+ n"'^^"''^^^), where 7 = r^/cr^ (see Section [2]), and used = 1/m. For 
the geometric prior we set qq = qj = 0.3. In addition, we compared the performances of sparse 
group MAP estimators with the sparse group lasso estimator ([3|) of Friedman, Hastie & Tibshirani 
(2010) described in Introduction. They do not discuss the optimal choices for Ai and A2 in ([3]). 
Some heuristical arguments are given in Simon et al. (2011). In our simulation study we considered 
instead two oracle-based choices for these tuning parameter giving thus a significant handicap to 
sparse group lasso estimators. Since in simulation examples the true mean vectors ^lj are known, 
they can be used for optimal choosing Ai and A2. In particular, we considered a "semi-oracle" 
sparse group lasso estimator, where we set A2 = 2a\J2 log n yielding universal soft thresholding 
within each vector (see Introduction) to compare the sparse group lasso with the binomial sparse 
group MAP. Al was chosen by minimizing the mean squared error XljLi -^IIAj(Ai) — Z^jHi estimated 
by averaging over a series of 1000 replications for each value of Ai by a grid search. In addition, we 
applied a "fully oracle" sparse group lasso estimator, where both Ai and A2 were chosen to minimize 
the mean squared error by the two-dimensional grid. It can be considered as a benchmark for the 
performance of sparse group lasso. Table [2] provides the resulting oracle choices for Ai and A2. 



7 


Al 


A2 


1 


11.8 


0.9 


9 


7.2 


1.1 


25 


4.7 


1.3 



Table 2: The oracle choices for the parameters of the fully oracle sparse group lasso estimator 
(7 = r^lo^). 

Table [2] shows that for all 7, the oracle choice for A2 in the sparse group lasso is much less than 
the conservative universal threshold 2a\/2Togn ~ 6.06. The oracle thresholding within each vector 
is thus much less severe and keeps more coefficients. The oracle choices for Ai were also quite small 
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and, as a result, for any 7, no single vector was thresholded by a fully oracle sparse group lasso, 
that is, all fi^ 7^ 0. Thus it was really a non-sparse estimator for the considered setup. 

In Table [3] we present the mean squared errors averaged over 1000 replications for the four 
sparse group estimators with the corresponding standard errors for various 7 (or, equivalently, r). 



7 


Sparse Group MAP 


Sparse Group MAP 


Sparse Group Lasso 


Sparse Group Lasso 




(binomial) 


(geometric) 


(semi-oracle) 


(fully oracle) 


1 


247.40 


245.46 


236.85 


161.89 




(0.71) 


(0.70) 


(0.65) 


(0.43) 


9 


608.02 


378.87 


1120.99 


403.76 




(1.96) 


(1.20) 


(2.29) 


(0.91) 


25 


549.77 


351.52 


1595.91 


475.47 




(1.68) 


(1.30) 


(2.79) 


(1.07) 



Table 3: MSEs averaged over 1000 replications for four sparse group estimators and the 
corresponding standard errors (in brackets) for various 7. 



For small 7 only few largest nonzero components can be distinguished from the noise that 
essentially corresponds to a sparse setting and explains good performance of binomial sparse group 
MAP and semi-oracle sparse group lasso estimators based on universal (respectively, hard and soft) 
thresholding within each vector. For larger 7, it becomes "over-conservative" . The negative effect of 
its conservativeness is much stronger for the soft than for hard thresholding (see comments below). 
The fully oracle sparse group lasso estimator strongly outperforms its semi-oracle counterpart 
especially for 7 = 9 (r = 3) and 7 = 25 (r = 5) also indicating that the universal thresholding 
is far from being optimal for sparse group lasso especially for moderate and large 7 (see also our 
previous comments on the optimal choice of A2). 

On the other hand, geometric sparse group MAP estimator corresponding to a nonlinear 
2k ln(n/A;)-type penalty (see Section[2|) provides good results for all 7 nicely following the theoretical 
results of Section [3l Moreover, for 7 = 9 and 7 = 25, it outperforms even the fully oracle sparse 
group lasso estimator that was essentially thought as a benchmark rather than a fair competitor. 
This indicates that that sparse group lasso faces general problems. In fact, it may be not so 
surprising since soft "shrink-or-kill" thresholding inherent for sparse group lasso is well-known to 
be superior to hard "keep-or-kill" thresholding in sparse group MAP estimation for small coefficients 
but worse for large ones due to the additional shrinkage. Moreover, sparse group lasso essentially 
involves a double amount of shrinkage - both within vectors and at each entire vector as a whole (see 
([3])). It thus causes unnecessary extra bias growing with 7 that outweighs the benefits of variance 
reduction. Similar phenomenon appears also for naive elastic set estimation (Zou & Hastie, 2005). 
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5 Concluding remarks 



In this paper we considered estimation of a sparse group of sparse normal mean vectors. The 
proposed approach is based on penahzed Ukehhood estimation with complexity penalties on 
both between- and within-sparsity and can be performed by a computationally fast algorithm. 
The resulting estimators naturally arise within Bayesian framework and can be viewed as MAP 
estimators corresponding to the priors on the number of nonzero mean vectors and the numbers 
of their nonzero components. Such a Bayesian perspective provides a natural tool for obtaining a 
wide class of penalized likelihood estimators with various complexity penalties. 

We established the adaptive minimaxity of sparse group MAP estimators to the unknown degree 
of between- and within-sparsity over a wide range of sparse and dense settings. The short simulation 
study demonstrates the efficiency of the proposed approach that outperforms the recently presented 
sparse group lasso estimator. 

Acknowledgments. Both authors were supported by the Israel Science Foundation grant 
ISF-248/08. We are grateful to Ofir Harari for his assistance in running simulation examples and 
Saharon Rosset for fruitful discussions. 

Appendix 

Throughout the proofs we use C to denote a generic positive constant, not necessarily the same 
each time it is used, even within a single equation. Similarly, C{'y) is a generic positive constant 
depending on 7. 

Proof of Theorem [1] 

As we have mentioned in Section [2l the sparse group MAP estimator can be viewed as a penalized 
likelihood estimator ([2]) with the complexity penalties (jlOp and (jlip . We first re- write it in a 
somewhat different form that will allow us then to apply the general results of Birge & Massart 
(2001) for complexity penalized estimators. 

Let y = (yii, yni 5 ?/im; 2/nm.)' be an amalgamated nx m vector of data. Similarly, 
/I = (^11, ...,^„™,)', e = (eii,...,e„i,...,eim,...,e„m)' and the original model (H]) can 

be re-written now as 

yi = fii + ei, ej AA(0,cr^), i = 1, ...,nm (18) 

Define an indicator vector d, where di = I{/U.j 7^ 0}, i = l,...,nm. In terms of the model 
CHD, hj = Erin(i-i)+i = l,...,m and mo = #{j : hj > 0}. For a given d, define 
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= Sj^i = '■ di = l, i = l, nm} and 



771 

mo 



for d ^ and Lq = 21n7rQ'^(0), where we formally set TTj{0) = 1. Then, the sparse group MAP 
estimator p, = {fiii, flni, fiim, finm)' is the penalized likelihood estimator of /x with the 
complexity penalty 

Pen(d) = 2^2(1 + 1/7) ||f;in(^7i(/i^0(^"J(l + 7)^)+ln(^^^^ 

= a2(l + l/7)Z)d(2Ld + ln(l + 7)) 

for d ^ and Pen(O) = al{l + l/'j)Lo. 
One can verify that 

m 

J^e-^'^^<^= J^7ro(A:) = l-vro(0) 

d^O k=l 

A straightforward calculus (see the proof of Theorem 1 of Abramovich, Grinshtein & Pensky, 2007 
for more details) implies also that for any d under Assumption (P), 

(1 + l/7)(2Ld + ln(l + 7)) > C(7)(l + y^f, 

where C{'y) > 1. One can then apply Theorem 2 of Birge &: Massart (2001) to get 

m I / " 

+ E E 4 + ^eno(mo) i + 02(7)^7^(1 - vro(O)) (19) 

□ 



Proof of Theorem [2] 

One can easily check from Table [T] that for 7/j.„ > n~^/ mm(pj ,2) ^ p, y ^g^t term 

C2(7)o'^(l — vro(O)) in the RHS of (fT3]) is of order O(cr^) = o(ii(Oj[77j>t])) for all nonzero and all 
Pj > 0. 

Let JTq^* be the true (unknown) subset of nonzero fi s and mg = \J'o*\- 
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I. niQ < [m/ej . 

Apply Theorem [I for Jo = Jq: 



Y,E\\fij- fijWl < ci(7)< 



E_ ,S!;?„ (,t "f.,, + + 1/7) in (-/■C.) (,;) (1 + 7)*) 



+ 2alil + 1/7) In (7ro-i(mo) (^J) } + C2(7)^n(l 



vro(O)) 



Since for uiq = 1,..., [m/ej, (^^) < (m/mo)^™" (see Lemma Al of Abramovich ei. a/, 2010), the 
required conditions on vro(-) ensure that 

2^2(1 + 1/7) In (^o^H"io)(^J) < C(7)a2moln(m/mo) 
To complete the proof for this case we consider now separately 

,s;?„ + 2.^(1 + 1/.) i„ (7'("i) (rj (1 + 7)*) j (20) 

for each j G ^q^* and show that it is 0{R{Qj[r]jn])) (see Table [T]). We distinguish between 
several cases, where the proofs for strong Zp-balls will follow immediately from the proofs for the 
corresponding weak /p-balls due to the embedding properties mentioned in Section [3l 

Case 1: fXj € QjiVju], Vjn > e~^^"'^ for pj = and r/^^ > e~'^^"'^ for pj > 0. Taking h* = n, under 
the condition on ttj (n) implies that is 0{aln) = 0{R{ej[r]jn])). 

Case 2: fij € /o[^jn]) Vjn 

< e~'=('^). Note that since fij ^ 0, r]j n ^ n ^. Choose h* = nrjjn and 
repeat the arguments of the proof of Theorem 3 of Abramovich, Grinshtein & Pensky (2007) using 
a slightly more general Lemma Al of Abramovich et. al (2010) for approximating the binomial 
coefficient in (j20p instead of their original Lemma A.l. 

Case 3: fij G mp^[??in], < Pj < 2, n-^{lnn)P^/^ < rf-'^ < e'^'fl Take 1 < h* = 
nr/j^ (In 77^^^)"^^/^ < ne~^^'^^ and follow the proof of Theorem 4 of Abramovich, Grinshtein & 
Pensky (2007) with a more general version of Lemma Al (see Case 2). 

Case 4: fXj G mp^[r]jn], Pj > 2, n~P^ {In n)P^/'^ < ifj'^ < e-^^^). Take h* = 1. Then, for pj > 2 



and, similarly, for pj = 2 

E2 ^22/ -Ir 22i 

< (^nnVjn X dx = annr]j,Jnn 

I, * 11 -^1 
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2 2 \ 



On the other hand, under the conditions on 7rj(-), 7rj(l) > n that yields 

2al{l + 1/7) In (^7i(l)nyrT^) = 0(^2 Inn) = 0{a, 

for i]jn > \/n~i Inn. 
II. [m/ej < niQ < m. 

Apply Theorem [J for Jq = {1, ...,m} (or, equivalently, Jq = 0) and hj = 1 for j G ^q* : 

m I / \ 

Y,E\\fij - tijWl < ci(7)< ^ min^ ^ /i^.^^. + Penj(/ij) + ^ Penj(l) + Peno(m) 

+ C2(7)a2(l-vro(0)), (21) 

where the conditions on vrj(l) and '/ro(m) imply ^j^j* Penj{l) = 0{a'^mlnn) and PenQ^m) = 
0{alm). From Table [2 one can verify that for all dense and sparse cases, a^lnn = 
0{R{@j[7]jn]), j G J^Q* and, therefore, the first term J2jeJ^f^ RHS of ([2T|) is dominating 

for rriQ ~ m. 
□ 

Proof of Theorems [SM] 

The ideas of the proofs of both theorems on the minimax lower bounds are similar and can be 
combined. 

Note first that any estimator cannot perform better than an oracle who knows the true Jq. In 
this (ideal) case one would obviously set p,j = for all j G with zero risk and, therefore, due to 
the additivity of the risk function, 

m 

_ inf_ sup ^E\\iij - ^lj\\l>C ^ R{Qj[mn\) 

for any 0jn[^jn] (see, e.g., Johnstone, 2011, Proposition 4.14). 

Furthermore, following Case II in the proof of Theorem [21 Sjgj-^c i?(Gj[?/j>t]) dominates over 
(7^?no ln(m/mo) in (jl5p and (|16p for ttiq > ni/2. To complete the proof we need to show, therefore, 
that for niQ < m/2, the minimal unavoidable price for not being an oracle for selecting nonzero 
fij^s is of order cj^mo ln(m/mo). 

The main idea of the proof is to find a subset Mmo of n x m vectors /x = 
{fill, fj.nl, fJ-im, fJ-nm)' with mo nonzero fij = [fiij, finj)' G 0j[??jn] such that for any 
pair fi^,fJ.^ G Admo and some C > 0, \\fi^ — fJ.'^\\2 > Ccr^mo ln(m/mo), while the Kullback-Leibler 
divergence K{F^i,¥^2) = Wfi^ - fj.'^\\y{2al) < (1/16) In card(A^mJ. The resuh wih then follow 
immediately from Lemma A.l of Bunea, Tsybakov &: Wegkamp (2007). 

Define the subset T>mo of all m-dimensional indicator vectors with mo entries of ones, that 
is Vrno = {d : d G {0,1}"^, ||d||o = mo}. By Lemma A. 3 of Rigollet k Tsybakov (2011), for 
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iTT-o < m/2 there exists a subset 25^0 C such that for some constant c > 0, lncard(2?mQ) > 
cmo ln(?Ti/mo), and for any pair di,d2 G 2?mo) the Hamming distance p(di,d2) = "^JLi^i^ij 
d2j} > crriQ. 

To any indicator vector d G assign the corresponding mean vector /x G A^j7io ^-s fohows. 
Let = (l/16)cr2cln(m/mo). Define /^^ = ((7, 0, 0)'I{(ij = 1} for < < 2 and = 
{Cn~^l^ ,Cn~^l'^ , ...,Cn-^l'^)'l{dj = 1} for > 2, j = l,...,m. Hence, card(>/™J = card(P^J. 
Obviously, the resulting /x^ G ^o[??jn] and a straightforward calculus shows that under the additional 
constraint on rjjn and mo in Theorem [U /i^ G /p^ [r/j„] . 

For any fJ-^, fJ-^ G Almo arid the corresponding di, d2 G fmo; we then have 

m 

11/^1 - /x2||2 = C'2^j{d^^. ^ d^j} > cmo = (l/16)f7^c2moln(m/mo) 
i=i 

and 

/-^2 /^2 

i^(P/^i,P/^2) = ^I{di, / d2,} < < (l/16)lncard(M^J 

□ 
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